## [1] 4e-04
## [1] 0.03358333

Not stationary (why?), so difference by 1 lag.

And if we difference by 2 lags:

Plot ACF and PACF

## [1] 0.92
## [1] "No"
## [1] 0.97
## [1] "Yes"

## [1] 0.98
## [1] "Yes"
## [1] 0.96
## [1] "Yes"

Let’s try different ARMA(p,q) models for Dx:

## initial  value -5.517013 
## iter   2 value -5.610226
## iter   3 value -5.639102
## iter   4 value -5.667613
## iter   5 value -5.680920
## iter   6 value -5.686832
## iter   7 value -5.688360
## iter   8 value -5.690604
## iter   9 value -5.691425
## iter  10 value -5.691657
## iter  11 value -5.691992
## iter  12 value -5.693876
## iter  13 value -5.694710
## iter  14 value -5.695386
## iter  15 value -5.695689
## iter  16 value -5.695873
## iter  17 value -5.695883
## iter  18 value -5.695895
## iter  19 value -5.695895
## iter  20 value -5.695895
## iter  20 value -5.695895
## iter  20 value -5.695895
## final  value -5.695895 
## converged
## initial  value -5.526162 
## iter   2 value -5.543360
## iter   3 value -5.552982
## iter   4 value -5.559251
## iter   5 value -5.561630
## iter   6 value -5.564113
## iter   7 value -5.565733
## iter   8 value -5.566584
## iter   9 value -5.567171
## iter  10 value -5.567757
## iter  11 value -5.567902
## iter  12 value -5.568102
## iter  13 value -5.568339
## iter  14 value -5.568562
## iter  15 value -5.568747
## iter  16 value -5.568992
## iter  17 value -5.569225
## iter  18 value -5.569312
## iter  19 value -5.569330
## iter  20 value -5.569338
## iter  21 value -5.569340
## iter  22 value -5.569340
## iter  23 value -5.569340
## iter  23 value -5.569340
## iter  23 value -5.569340
## final  value -5.569340 
## converged

## $fit
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), xreg = constant, optim.control = list(trace = trc, REPORT = 1, 
##     reltol = tol))
## 
## Coefficients:
##           ar1      ar2     ar3      ar4      ar5      ar6      ar7
##       -0.5590  -0.0396  0.0019  -0.0169  -0.0311  -0.1607  -0.1924
## s.e.   0.2401   0.1043  0.0997   0.1004   0.1035   0.0994   0.1110
##           ar8      ar9     ar10    ar11    ar12    ar13     ar14     ar15
##       -0.2988  -0.3405  -0.0914  0.0819  0.0875  0.0427  -0.1522  -0.2951
## s.e.   0.1086   0.1131   0.1076  0.1051  0.1045  0.1057   0.1058   0.1162
##          ar16     ma1  constant
##       -0.3723  0.6799    -1e-04
## s.e.   0.0968  0.2539     2e-04
## 
## sigma^2 estimated as 1.405e-05:  log likelihood = 473.15,  aic = -908.29
## 
## $degrees_of_freedom
## [1] 97
## 
## $ttable
##          Estimate     SE t.value p.value
## ar1       -0.5590 0.2401 -2.3277  0.0220
## ar2       -0.0396 0.1043 -0.3794  0.7053
## ar3        0.0019 0.0997  0.0194  0.9846
## ar4       -0.0169 0.1004 -0.1686  0.8665
## ar5       -0.0311 0.1035 -0.3009  0.7641
## ar6       -0.1607 0.0994 -1.6162  0.1093
## ar7       -0.1924 0.1110 -1.7332  0.0862
## ar8       -0.2988 0.1086 -2.7513  0.0071
## ar9       -0.3405 0.1131 -3.0105  0.0033
## ar10      -0.0914 0.1076 -0.8494  0.3978
## ar11       0.0819 0.1051  0.7787  0.4381
## ar12       0.0875 0.1045  0.8373  0.4045
## ar13       0.0427 0.1057  0.4036  0.6874
## ar14      -0.1522 0.1058 -1.4392  0.1533
## ar15      -0.2951 0.1162 -2.5393  0.0127
## ar16      -0.3723 0.0968 -3.8475  0.0002
## ma1        0.6799 0.2539  2.6776  0.0087
## constant  -0.0001 0.0002 -0.5220  0.6028
## 
## $AIC
## [1] -9.859923
## 
## $AICc
## [1] -9.772967
## 
## $BIC
## [1] -10.43028

Looking at the ACF for Dx, we note that there does appear to be seasonality. Each flu season has 23 weeks, and we observe spikes in the ACF every 23 weeks. So let’s try to account for seasonality.

## [1] 0.9333333
## [1] "No"
## [1] 0.9666667
## [1] "Yes"

Let’s try to fit a SARIMA model for D23x: Using good/bad to say whether final coefficients are significant. i.e. if the model is arima(2,1,3), I only need the AR(2) and MA(3) coefficients to be significant. Also, for better reasoning on choosing p,d,q,P,D,Q,S, look further down the page for where we fit a model to the data from vaccinated respondents.

## initial  value -5.536587 
## iter   2 value -5.678612
## iter   3 value -5.734583
## iter   4 value -5.752971
## iter   5 value -5.759835
## iter   6 value -5.760877
## iter   7 value -5.765845
## iter   8 value -5.772118
## iter   9 value -5.776981
## iter  10 value -5.780535
## iter  11 value -5.781611
## iter  12 value -5.783016
## iter  13 value -5.783849
## iter  14 value -5.784998
## iter  15 value -5.786078
## iter  16 value -5.786756
## iter  17 value -5.787928
## iter  18 value -5.788666
## iter  19 value -5.788950
## iter  20 value -5.789166
## iter  21 value -5.789552
## iter  22 value -5.789836
## iter  23 value -5.790100
## iter  24 value -5.790665
## iter  25 value -5.790914
## iter  26 value -5.791014
## iter  27 value -5.791061
## iter  28 value -5.791080
## iter  29 value -5.791092
## iter  30 value -5.791098
## iter  31 value -5.791104
## iter  32 value -5.791116
## iter  33 value -5.791131
## iter  34 value -5.791147
## iter  35 value -5.791158
## iter  36 value -5.791164
## iter  37 value -5.791170
## iter  38 value -5.791172
## iter  39 value -5.791172
## iter  39 value -5.791172
## iter  39 value -5.791172
## final  value -5.791172 
## converged
## initial  value -5.497575 
## iter   2 value -5.525397
## iter   3 value -5.545369
## iter   4 value -5.555749
## iter   5 value -5.555900
## iter   6 value -5.567166
## iter   7 value -5.570839
## iter   8 value -5.576464
## iter   9 value -5.579911
## iter  10 value -5.583920
## iter  11 value -5.593006
## iter  12 value -5.597103
## iter  13 value -5.597992
## iter  14 value -5.603064
## iter  15 value -5.605340
## iter  16 value -5.608235
## iter  17 value -5.608897
## iter  18 value -5.610248
## iter  19 value -5.610554
## iter  20 value -5.610790
## iter  21 value -5.610977
## iter  22 value -5.611261
## iter  23 value -5.611512
## iter  24 value -5.612009
## iter  25 value -5.612373
## iter  26 value -5.612564
## iter  27 value -5.612685
## iter  28 value -5.612775
## iter  29 value -5.612866
## iter  30 value -5.612999
## iter  31 value -5.613079
## iter  32 value -5.613271
## iter  33 value -5.613404
## iter  34 value -5.613458
## iter  35 value -5.613623
## iter  36 value -5.613690
## iter  37 value -5.613762
## iter  38 value -5.613815
## iter  39 value -5.613844
## iter  40 value -5.613870
## iter  41 value -5.613889
## iter  42 value -5.613902
## iter  43 value -5.613909
## iter  44 value -5.613912
## iter  45 value -5.613918
## iter  46 value -5.613922
## iter  47 value -5.613924
## iter  48 value -5.613925
## iter  49 value -5.613925
## iter  50 value -5.613925
## iter  50 value -5.613925
## final  value -5.613925 
## converged

## $fit
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), include.mean = !no.constant, optim.control = list(trace = trc, 
##     REPORT = 1, reltol = tol))
## 
## Coefficients:
##           ar1      ar2      ar3     ar4     ar5      ar6      ar7      ar8
##       -0.0448  -0.0710  -0.0447  0.4365  0.5544  -0.0948  -0.0417  -0.1566
## s.e.   0.4117   0.1974   0.1616  0.1531  0.3045   0.1371   0.1456   0.1322
##           ar9    ar10    ar11    ar12    ar13     ar14     ar15     ar16
##       -0.1812  0.1396  0.1659  0.0306  0.0580  -0.0612  -0.1571  -0.0416
## s.e.   0.1443  0.1416  0.1865  0.1431  0.1419   0.1371   0.1355   0.1259
##         ar17     ar18      ma1     ma2     ma3      ma4      ma5     sma1
##       0.2554  -0.1647  -0.1702  0.0199  0.2131  -0.5105  -0.5523  -0.2570
## s.e.  0.1236   0.1870   0.4253  0.2957  0.1834   0.2070   0.3702   0.1608
## 
## sigma^2 estimated as 1.117e-05:  log likelihood = 381.74,  aic = -713.49
## 
## $degrees_of_freedom
## [1] 91
## 
## $ttable
##      Estimate     SE t.value p.value
## ar1   -0.0448 0.4117 -0.1089  0.9135
## ar2   -0.0710 0.1974 -0.3597  0.7199
## ar3   -0.0447 0.1616 -0.2769  0.7825
## ar4    0.4365 0.1531  2.8511  0.0054
## ar5    0.5544 0.3045  1.8207  0.0719
## ar6   -0.0948 0.1371 -0.6913  0.4911
## ar7   -0.0417 0.1456 -0.2863  0.7753
## ar8   -0.1566 0.1322 -1.1849  0.2391
## ar9   -0.1812 0.1443 -1.2557  0.2124
## ar10   0.1396 0.1416  0.9862  0.3267
## ar11   0.1659 0.1865  0.8895  0.3761
## ar12   0.0306 0.1431  0.2136  0.8313
## ar13   0.0580 0.1419  0.4087  0.6837
## ar14  -0.0612 0.1371 -0.4460  0.6566
## ar15  -0.1571 0.1355 -1.1594  0.2493
## ar16  -0.0416 0.1259 -0.3303  0.7419
## ar17   0.2554 0.1236  2.0660  0.0417
## ar18  -0.1647 0.1870 -0.8804  0.3809
## ma1   -0.1702 0.4253 -0.4001  0.6900
## ma2    0.0199 0.2957  0.0672  0.9465
## ma3    0.2131 0.1834  1.1617  0.2484
## ma4   -0.5105 0.2070 -2.4659  0.0155
## ma5   -0.5523 0.3702 -1.4916  0.1393
## sma1  -0.2570 0.1608 -1.5981  0.1135
## 
## $AIC
## [1] -9.985156
## 
## $AICc
## [1] -9.84075
## 
## $BIC
## [1] -10.4123

The very good model was best.

## $pred
## Time Series:
## Start = 116 
## End = 125 
## Frequency = 1 
##  [1] 0.02040377 0.01367033 0.01651095 0.02295284 0.02232263 0.02037957
##  [7] 0.01966316 0.01978917 0.02059040 0.02122259
## 
## $se
## Time Series:
## Start = 116 
## End = 125 
## Frequency = 1 
##  [1] 0.003419422 0.004374180 0.005043979 0.005982023 0.006725938
##  [6] 0.007285459 0.007552876 0.007843906 0.008100484 0.008197078

Let’s try to fit a model omitting the final 2 values, then predict the next 2 values and compare to the actual ones

## $pred
## Time Series:
## Start = 111 
## End = 115 
## Frequency = 1 
## [1] 0.022127733 0.019324457 0.014278120 0.010703089 0.005199534
## 
## $se
## Time Series:
## Start = 111 
## End = 115 
## Frequency = 1 
## [1] 0.003414511 0.004343294 0.005065274 0.006066540 0.006882498
## [1] 0.030 0.029 0.025 0.022 0.020

Do the actual 114th and 115th values of the time series lie within the 95% bounds of our predicted 114th and 115th values? (Make this look nicer)

##              [,1]  [,2]       [,3]
## [1,]  0.015435292 0.030 0.02882017
## [2,]  0.010811601 0.029 0.02783731
## [3,]  0.004350183 0.025 0.02420606
## [4,] -0.001187331 0.022 0.02259351
## [5,] -0.008290162 0.020 0.01868923

Yes (above shows the lower CI, actual value, upper CI) Plotting actual vs predicted:

As you can see from this plot, the next two values the time series (red line) predictsare very close to the actual values (black line). Certainly, the actual values lie well within the 95% confidence bounds (dotted red lines).

Note: We can also predict value for Oct 16, 2016, which we have the data for but omitted so that we had 23 weeks per season.

Pre-whitening: (NB: This did not work. Do we still include? Maybe in report, not poster.)

Let’s take a look at the Google Trends data:

Not stationary. Try differencing.

Fit an ARMA model to the differenced data.

## [1] 0.9823009
## [1] "Yes"
## [1] 0.9823009
## [1] "Yes"
## initial  value 0.828263 
## iter   2 value 0.814788
## iter   3 value 0.778813
## iter   4 value 0.777724
## iter   5 value 0.776237
## iter   6 value 0.773120
## iter   7 value 0.770630
## iter   8 value 0.767976
## iter   9 value 0.766622
## iter  10 value 0.765858
## iter  11 value 0.765474
## iter  12 value 0.764534
## iter  13 value 0.762100
## iter  14 value 0.760753
## iter  15 value 0.759372
## iter  16 value 0.757814
## iter  17 value 0.757157
## iter  18 value 0.754637
## iter  19 value 0.749358
## iter  20 value 0.749193
## iter  21 value 0.748115
## iter  22 value 0.747874
## iter  23 value 0.747792
## iter  24 value 0.747757
## iter  25 value 0.747755
## iter  26 value 0.747754
## iter  27 value 0.747754
## iter  27 value 0.747754
## iter  27 value 0.747754
## final  value 0.747754 
## converged
## initial  value 0.753023 
## iter   2 value 0.751907
## iter   3 value 0.750952
## iter   4 value 0.749253
## iter   5 value 0.748278
## iter   6 value 0.747631
## iter   7 value 0.747253
## iter   8 value 0.742399
## iter   9 value 0.741944
## iter  10 value 0.741476
## iter  11 value 0.741391
## iter  12 value 0.741246
## iter  13 value 0.741176
## iter  14 value 0.740979
## iter  15 value 0.740830
## iter  16 value 0.740762
## iter  17 value 0.740756
## iter  18 value 0.740756
## iter  18 value 0.740756
## iter  18 value 0.740756
## final  value 0.740756 
## converged

## $fit
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), include.mean = !no.constant, optim.control = list(trace = trc, 
##     REPORT = 1, reltol = tol))
## 
## Coefficients:
##          ar1      ar2      ma1     ma2
##       0.2938  -0.9191  -0.4733  1.0000
## s.e.  0.0382   0.0381   0.0270  0.0527
## 
## sigma^2 estimated as 4.205:  log likelihood = -246.21,  aic = 502.41
## 
## $degrees_of_freedom
## [1] 111
## 
## $ttable
##     Estimate     SE  t.value p.value
## ar1   0.2938 0.0382   7.6856       0
## ar2  -0.9191 0.0381 -24.1349       0
## ma1  -0.4733 0.0270 -17.5042       0
## ma2   1.0000 0.0527  18.9879       0
## 
## $AIC
## [1] 2.505817
## 
## $AICc
## [1] 2.527995
## 
## $BIC
## [1] 1.601293
## initial  value 0.828263 
## iter   2 value 0.775973
## iter   3 value 0.773992
## iter   4 value 0.773947
## iter   5 value 0.773947
## iter   5 value 0.773947
## iter   5 value 0.773947
## final  value 0.773947 
## converged
## initial  value 0.773687 
## iter   2 value 0.773685
## iter   3 value 0.773684
## iter   3 value 0.773684
## iter   3 value 0.773684
## final  value 0.773684 
## converged

## $fit
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), include.mean = !no.constant, optim.control = list(trace = trc, 
##     REPORT = 1, reltol = tol))
## 
## Coefficients:
##           ar1      ar2
##       -0.2344  -0.2692
## s.e.   0.0911   0.0914
## 
## sigma^2 estimated as 4.691:  log likelihood = -249.96,  aic = 505.92
## 
## $degrees_of_freedom
## [1] 113
## 
## $ttable
##     Estimate     SE t.value p.value
## ar1  -0.2344 0.0911 -2.5737  0.0114
## ar2  -0.2692 0.0914 -2.9446  0.0039
## 
## $AIC
## [1] 2.580526
## 
## $AICc
## [1] 2.599798
## 
## $BIC
## [1] 1.628264

Similar AICc, but we want a model with no MA terms. So use the 2nd.

##        ar1        ar2 
## -0.2343787 -0.2691571

Filter the time series x using the model for y

Google Trends and Flutracking data do not appear to be correlated, so we shouldn’t use the Google Trends data to predict the Flutracking data.

We should also try fitting a model to the vaccinated survey respondents (i.e. % ofvaccinated survey respondents who report flu-related symptoms during a given week).

## [1] 3e-04
## [1] 0.02883333

Not stationary (why?), so difference by one lag

There is an outlier. Difference by 2 lags?

Same outlier. We will use Dz for now, but this is bad.

## [1] 0.93
## [1] "No"
## [1] 0.95
## [1] "Yes"

There is still a seasonal effect every 23 weeks. So let’s account for seasonality.

## [1] 0.9555556
## [1] "Yes"
## [1] 0.9666667
## [1] "Yes"

Looking at the seasons (every 23 lags), the ACF has a spike at 23 (i.e. Q=1). We can say either the ACF tails off, in which case we can choose a value for P (say P=1, as there is a spike in the PACF at 23); or we can say the ACF cuts off, in which case P=0 by definition. Looking within each 23 week season, we can say that: the ACF cuts off at 1 (q=1 & p=0); we can say the ACF trails off with a spike at 1 and the PACF trails off with a spike at 1 (q=1 & p=1); we can say that the PACF cuts off at 1 (p=1 & q=0). Also consider possible spikes in the ACF and PACF at lag 9.

Cases where P=0 and Q=1

## initial  value -5.664793 
## iter   2 value -5.743848
## iter   3 value -5.748846
## iter   4 value -5.749018
## iter   5 value -5.749020
## iter   5 value -5.749020
## iter   5 value -5.749020
## final  value -5.749020 
## converged
## initial  value -5.761476 
## iter   2 value -5.768288
## iter   3 value -5.774187
## iter   4 value -5.774277
## iter   5 value -5.774288
## iter   5 value -5.774288
## iter   5 value -5.774288
## final  value -5.774288 
## converged

## $fit
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), include.mean = !no.constant, optim.control = list(trace = trc, 
##     REPORT = 1, reltol = tol))
## 
## Coefficients:
##           ma1     sma1
##       -0.2184  -0.6634
## s.e.   0.0875   0.2352
## 
## sigma^2 estimated as 8.367e-06:  log likelihood = 396.34,  aic = -786.67
## 
## $degrees_of_freedom
## [1] 113
## 
## $ttable
##      Estimate     SE t.value p.value
## ma1   -0.2184 0.0875 -2.4959  0.0140
## sma1  -0.6634 0.2352 -2.8206  0.0057
## 
## $AIC
## [1] -10.65648
## 
## $AICc
## [1] -10.63721
## 
## $BIC
## [1] -11.60875
## initial  value -5.666641 
## iter   2 value -5.753119
## iter   3 value -5.757444
## iter   4 value -5.757500
## iter   4 value -5.757500
## iter   4 value -5.757500
## final  value -5.757500 
## converged
## initial  value -5.769069 
## iter   2 value -5.777173
## iter   3 value -5.783975
## iter   4 value -5.784106
## iter   5 value -5.784122
## iter   6 value -5.784122
## iter   6 value -5.784122
## final  value -5.784122 
## converged

## $fit
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), include.mean = !no.constant, optim.control = list(trace = trc, 
##     REPORT = 1, reltol = tol))
## 
## Coefficients:
##           ar1     sma1
##       -0.2828  -0.6942
## s.e.   0.1005   0.2545
## 
## sigma^2 estimated as 8.062e-06:  log likelihood = 397.23,  aic = -788.46
## 
## $degrees_of_freedom
## [1] 113
## 
## $ttable
##      Estimate     SE t.value p.value
## ar1   -0.2828 0.1005 -2.8139  0.0058
## sma1  -0.6942 0.2545 -2.7276  0.0074
## 
## $AIC
## [1] -10.69359
## 
## $AICc
## [1] -10.67432
## 
## $BIC
## [1] -11.64585

Cases where P=1 and Q=1

## $pred
## Time Series:
## Start = 116 
## End = 125 
## Frequency = 1 
##  [1] 0.01576768 0.01586229 0.01998301 0.01999209 0.02184931 0.02258010
##  [7] 0.02148729 0.02312899 0.02208322 0.02407620
## 
## $se
## Time Series:
## Start = 116 
## End = 125 
## Frequency = 1 
##  [1] 0.002879329 0.003534874 0.004205850 0.004753199 0.005251460
##  [6] 0.005704361 0.006124389 0.006517262 0.006887800 0.007239387

Let’s try to fit a model omitting the final 2 values, then predict the next 2 values and compare to the actual ones

## $pred
## Time Series:
## Start = 96 
## End = 115 
## Frequency = 1 
##  [1] 0.02039634 0.02362814 0.02379617 0.02237217 0.02370805 0.02305938
##  [7] 0.02507894 0.02461871 0.02352787 0.02425673 0.02708805 0.02954827
## [13] 0.03370264 0.03512297 0.03097724 0.02989696 0.02906867 0.02189291
## [19] 0.01863522 0.01597923
## 
## $se
## Time Series:
## Start = 96 
## End = 115 
## Frequency = 1 
##  [1] 0.003129429 0.003771994 0.004492737 0.005061512 0.005587611
##  [6] 0.006063709 0.006506443 0.006920485 0.007311249 0.007682121
## [11] 0.008035908 0.008374758 0.008700423 0.009014329 0.009317667
## [16] 0.009611435 0.009896487 0.010173556 0.010443276 0.010706202
##  [1] 0.017 0.018 0.019 0.018 0.020 0.017 0.020 0.021 0.022 0.019 0.021
## [12] 0.020 0.023 0.023 0.025 0.024 0.023 0.021 0.018 0.018

Do the actual 114th and 115th values of the time series lie within the 95% bounds of our predicted 114th and 115th values? (Make this look nicer)

##               [,1]  [,2]       [,3]
##  [1,]  0.014262655 0.017 0.02653002
##  [2,]  0.016235032 0.018 0.03102125
##  [3,]  0.014990407 0.019 0.03260194
##  [4,]  0.012451608 0.018 0.03229274
##  [5,]  0.012756331 0.020 0.03465977
##  [6,]  0.011174512 0.017 0.03494425
##  [7,]  0.012326316 0.020 0.03783157
##  [8,]  0.011054559 0.021 0.03818286
##  [9,]  0.009197817 0.022 0.03785791
## [10,]  0.009199774 0.019 0.03931369
## [11,]  0.011337673 0.021 0.04283843
## [12,]  0.013133747 0.020 0.04596280
## [13,]  0.016649809 0.023 0.05075547
## [14,]  0.017454887 0.023 0.05279106
## [15,]  0.012714613 0.025 0.04923987
## [16,]  0.011058547 0.024 0.04873537
## [17,]  0.009671557 0.023 0.04846579
## [18,]  0.001952743 0.021 0.04183308
## [19,] -0.001833599 0.018 0.03910404
## [20,] -0.005004931 0.018 0.03696338

Yes (above shows the lower CI, actual value, upper CI) Plotting actual vs predicted:

As you can see from this plot, the next two values the time series (red line) predicts are close to the actual values (black line). The predicted values are slighly lower than expected, but the actual values do lie within the 95% confidence bounds (dotted red lines).